144
Applications in Natural Language Processing
The activations are binarized following previous works as:
ˆXi
B = Sign(Xi
R) =
−1,
if Xi
R < 0
+1,
if Xi
R ⩾0
(5.37)
In that case,
ˆ
XB
T ˆXB = nXR, where nXR is number of elements in XR, and α∗can be
solved as:
α∗= XR
T ˆXB
nXR
= ||XR||l1
nXR
(5.38)
For the activations in attention layers or after the ReLU non-linearity layers with XR ∈
Rn
+, the authors binarized the activations to ˆXB ∈{0, 1}n by rounding the real-valued
activations:
ˆXi
B = ⌊Clip(Xi
R, 0, 1)⌉=
0,
if Xi
R < 0.5
1,
if Xi
R ⩾0.5
(5.39)
In that case,
ˆ
XB
T ˆXB = n{XR⩾0.5} where n{XR⩾0.5} denotes the number of elements in XR
that are greater than or equal to 0.5. Then α∗can be solved as:
α∗= ||XR · 1{XR⩾0.5}||l1
n{XR⩾0.5}
(5.40)
5.10.2
Elastic Binarization Function
The fixed scaling and threshold derived previously works reasonably well, but might not be
optimal since it ignores the distribution of the variable which is being binarized. Ideally,
these parameters can be learned during training to minimize the target loss.
When using classical binarization methods, i.e., ˆXi
B = Sign(Xi
R), the binary output
is independent of the scale of the real-valued input. However, in our case where ˆXi
B =
⌊Clip(Xi
R, 0, 1)⌉, this independence no longer holds. Learning the scaling and threshold
parameters, and how to approximate the gradients precisely in the process becomes crucial
for the final accuracy.
To handle this, the authors proposed the elastic binarization function to learn both the
scale α ∈R+ and the threshold β ∈R:
Xi
B = α ˆXi
B = α⌊Clip(Xi
R −β
α
, 0, 1)⌉
(5.41)
In the function, α is initialized with α∗in Eq. (5.38) and β to be 0, and it is trained with
gradients from the final loss. To back-propagate the gradients to α through the discretized
binarization function, the straight-through estimator (STE) [9] is leveraged to bypass the
incoming gradients to the round function to be the outgoing gradients:
∂Xi
B
∂α
= ˆXi
B + α∂ˆXi
B
∂α
ST E
≈
ˆXi
B + α∂Clip( Xi
R−β
α
, 0, 1)
∂α
=
⎧
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎩
0,
if Xi
R < β
β−Xi
R
α
,
if β ⩽Xi
R < α/2 + β
1 −Xi
R−β
α
,
if α/2 + β ⩽Xi
R < α + β
1,
if Xi
R ⩾α + β
(5.42)